Deep neural networks (DNNs) are vulnerable to a class of attacks called "backdoor attacks", which create an association between a backdoor trigger and a target label the attacker is interested in exploiting. A backdoored DNN performs well on clean test images, yet persistently predicts an attacker-defined label for any sample in the presence of the backdoor trigger. Although backdoor attacks have been extensively studied in the image domain, there are very few works that explore such attacks in the video domain, and they tend to conclude that image backdoor attacks are less effective in the video domain. In this work, we revisit the traditional backdoor threat model and incorporate additional video-related aspects to that model. We show that poisoned-label image backdoor attacks could be extended temporally in two ways, statically and dynamically, leading to highly effective attacks in the video domain. In addition, we explore natural video backdoors to highlight the seriousness of this vulnerability in the video domain. And, for the first time, we study multi-modal (audiovisual) backdoor attacks against video action recognition models, where we show that attacking a single modality is enough for achieving a high attack success rate.
translated by 谷歌翻译
The existence of label noise imposes significant challenges (e.g., poor generalization) on the training process of deep neural networks (DNN). As a remedy, this paper introduces a permutation layer learning approach termed PermLL to dynamically calibrate the training process of the DNN subject to instance-dependent and instance-independent label noise. The proposed method augments the architecture of a conventional DNN by an instance-dependent permutation layer. This layer is essentially a convex combination of permutation matrices that is dynamically calibrated for each sample. The primary objective of the permutation layer is to correct the loss of noisy samples mitigating the effect of label noise. We provide two variants of PermLL in this paper: one applies the permutation layer to the model's prediction, while the other applies it directly to the given noisy label. In addition, we provide a theoretical comparison between the two variants and show that previous methods can be seen as one of the variants. Finally, we validate PermLL experimentally and show that it achieves state-of-the-art performance on both real and synthetic datasets.
translated by 谷歌翻译
Adversarial training is an effective approach to make deep neural networks robust against adversarial attacks. Recently, different adversarial training defenses are proposed that not only maintain a high clean accuracy but also show significant robustness against popular and well studied adversarial attacks such as PGD. High adversarial robustness can also arise if an attack fails to find adversarial gradient directions, a phenomenon known as `gradient masking'. In this work, we analyse the effect of label smoothing on adversarial training as one of the potential causes of gradient masking. We then develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA). Our attack approach is based on a `match and deceive' loss that finds optimal adversarial directions through guidance from a surrogate model. Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size. Furthermore, our proposed G-PGA is generic, thus it can be combined with an ensemble attack strategy as we demonstrate for the case of Auto-Attack, leading to efficiency and convergence speed improvements. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
translated by 谷歌翻译
Although existing semi-supervised learning models achieve remarkable success in learning with unannotated in-distribution data, they mostly fail to learn on unlabeled data sampled from novel semantic classes due to their closed-set assumption. In this work, we target a pragmatic but under-explored Generalized Novel Category Discovery (GNCD) setting. The GNCD setting aims to categorize unlabeled training data coming from known and novel classes by leveraging the information of partially labeled known classes. We propose a two-stage Contrastive Affinity Learning method with auxiliary visual Prompts, dubbed PromptCAL, to address this challenging problem. Our approach discovers reliable pairwise sample affinities to learn better semantic clustering of both known and novel classes for the class token and visual prompts. First, we propose a discriminative prompt regularization loss to reinforce semantic discriminativeness of prompt-adapted pre-trained vision transformer for refined affinity relationships. Besides, we propose a contrastive affinity learning stage to calibrate semantic representations based on our iterative semi-supervised affinity graph generation method for semantically-enhanced prompt supervision. Extensive experimental evaluation demonstrates that our PromptCAL method is more effective in discovering novel classes even with limited annotations and surpasses the current state-of-the-art on generic and fine-grained benchmarks (with nearly $11\%$ gain on CUB-200, and $9\%$ on ImageNet-100) on overall accuracy.
translated by 谷歌翻译
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies, compared to the local convolutional-based design. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters and compute cost. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on three benchmarks, Synapse, BTCV and ACDC, reveal the effectiveness of the proposed contributions in terms of both efficiency and accuracy. On Synapse dataset, our UNETR++ sets a new state-of-the-art with a Dice Similarity Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best existing method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.
translated by 谷歌翻译
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.
translated by 谷歌翻译
We propose a very fast frame-level model for anomaly detection in video, which learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models. To improve the fidelity of our student, we distill the low-resolution anomaly maps of the teachers by jointly applying standard and adversarial distillation, introducing an adversarial discriminator for each teacher to distinguish between target and generated anomaly maps. We conduct experiments on three benchmarks (Avenue, ShanghaiTech, UCSD Ped2), showing that our method is over 7 times faster than the fastest competing method, and between 28 and 62 times faster than object-centric models, while obtaining comparable results to recent methods. Our evaluation also indicates that our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS. In addition, we carry out a comprehensive ablation study to justify our architectural design choices.
translated by 谷歌翻译
在计算机视觉领域,异常检测最近引起了越来越多的关注,这可能是由于其广泛的应用程序从工业生产线上的产品故障检测到视频监视中即将发生的事件检测到在医疗扫描中发现病变。不管域如何,通常将异常检测构架为一级分类任务,其中仅在正常示例上进行学习。整个成功的异常检测方法的家庭基于学习重建掩盖的正常输入(例如贴片,未来帧等),并将重建误差的幅度作为异常水平的指标。与其他基于重建的方法不同,我们提出了一种新颖的自我监督蒙面的卷积变压器块(SSMCTB),该卷积变压器块(SSMCTB)包括基于重建的功能在核心架构层面上。拟议的自我监督块非常灵活,可以在神经网络的任何层上掩盖信息,并与广泛的神经体系结构兼容。在这项工作中,我们扩展了以前的自我监督预测性卷积专注块(SSPCAB),并具有3D掩盖的卷积层,以及用于频道注意的变压器。此外,我们表明我们的块适用于更广泛的任务,在医学图像和热视频中添加异常检测到基于RGB图像和监视视频的先前考虑的任务。我们通过将SSMCTB的普遍性和灵活性整合到多个最先进的神经模型中,以进行异常检测,从而带来了经验结果,可以证实对五个基准的绩效改进:MVTEC AD,BRATS,BRATS,Avenue,Shanghaitech和Thermal和Thermal和Thermal罕见事件。我们在https://github.com/ristea/ssmctb上发布代码和数据作为开源。
translated by 谷歌翻译
现有的基于深度学习的3D对象检测器通常依赖于单个对象的外观,并且不明确注意场景的丰富上下文信息。在这项工作中,我们为3D对象检测(CMR3D)框架提出了上下文化的多阶段完善,该框架将3D场景作为输入,并努力在多个级别上明确整合场景的有用上下文信息,以预测一组对象界限盒以及它们相应的语义标签。为此,我们建议利用一个上下文增强网络,该网络在不同级别的粒度级别上捕获上下文信息,然后是多阶段修补模块,以逐步完善框位置和类预测。大规模ScannETV2基准测试的广泛实验揭示了我们提出的方法的好处,从而使基线的绝对提高了2.0%。除3D对象检测外,我们还研究了CMR3D框架在3D对象计数问题上的有效性。我们的源代码将公开发布。
translated by 谷歌翻译
在过去的十年中,基于深度学习的算法在遥感图像分析的不同领域中广泛流行。最近,最初在自然语言处理中引入的基于变形金刚的体系结构遍布计算机视觉领域,在该字段中,自我发挥的机制已被用作替代流行的卷积操作员来捕获长期依赖性。受到计算机视觉的最新进展的启发,遥感社区还见证了对各种任务的视觉变压器的探索。尽管许多调查都集中在计算机视觉中的变压器上,但据我们所知,我们是第一个对基于遥感中变压器的最新进展进行系统评价的人。我们的调查涵盖了60多种基于变形金刚的60多种方法,用于遥感子方面的不同遥感问题:非常高分辨率(VHR),高光谱(HSI)和合成孔径雷达(SAR)图像。我们通过讨论遥感中变压器的不同挑战和开放问题来结束调查。此外,我们打算在遥感论文中频繁更新和维护最新的变压器,及其各自的代码:https://github.com/virobo-15/transformer-in-in-remote-sensing
translated by 谷歌翻译